NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Metagenomic Binning Problem: Clustering Markov Sequences

https://doi.org/10.1109/TMBMC.2023.3336254

Greenberg, Grant; Shomorony, Ilan (March 2024, IEEE Transactions on Molecular, Biological, and Multi-Scale Communications)

Full Text Available
LexicHash: sequence similarity estimation via lexicographic comparison of hashes

https://doi.org/10.1093/bioinformatics/btad652

Greenberg, Grant; Ravi, Aditya Narayan; Shomorony, Ilan (November 2023, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. ResultsIn this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how “lexicographically similar” the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision–recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. Availability and implementationLexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
more » « less
Full Text Available
CVQVAE: A representation learning based method for multi-omics single cell data integration

Liu, Tianyu; Greenberg, Grant; Shomorony, Ilan (November 2022, Proceedings of Machine Learning Research)

Full Text Available
Improving bacterial genome assembly using a test of strand orientation

https://doi.org/10.1093/bioinformatics/btac516

Greenberg, Grant; Shomorony, Ilan (September 2022, Bioinformatics)

Abstract Summary The complexity of genome assembly is due in large part to the presence of repeats. In particular, large reverse-complemented repeats can lead to incorrect inversions of large segments of the genome. To detect and correct such inversions in finished bacterial genomes, we propose a statistical test based on tetranucleotide frequency (TNF), which determines whether two segments from the same genome are of the same or opposite orientation. In most cases, the test neatly partitions the genome into two segments of roughly equal length with seemingly opposite orientations. This corresponds to the segments between the DNA replication origin and terminus, which were previously known to have distinct nucleotide compositions. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. After inverting the sequence between the repeat, the balance of the misassembled genome is restored. Our method identifies 31 potential misassemblies in the NCBI database, several of which are further supported by a reassembly of the read data. Availability and implementation A github repository is available at https://github.com/gcgreenberg/Oriented-TNF.git. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available

Search for: All records